Real Estate Dataset -Exploratory and Descriptive Analysis

Author
Affiliation

Natacha Iradukunda and Gemima Grace Wishavura

Junior Data Analysts

Published

June 28, 2025

In this notebook, we carry out an in-depth exploratory and descriptive analysis of Real Estate Dataset, a widely used dataset for income prediction tasks based on assesed value,sales ratio ans sales amount attributes.

This phase of analysis is essential for uncovering patterns, detecting potential biases, and gaining intuition about the dataset’s structure before applying any modelling procedures. We examine the distribution of key numerical and categorical variables, investigate relationships between assesed value and sales amount, and use visualizations to summarize insights.

We begin our analysis by importing the core Python libraries required for data handling, numerical computation, visualization, and directory management:

Code
import pandas as pd 
import numpy as np 
import os 
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

1 Define and Create Directory Paths

To ensure reproducibility and organized storage, we programmatically create directories if they don’t already exist for:

  • raw data
  • processed data
  • results
  • documentation

These directories will store intermediate and final outputs for reproducibility.

2 Loading the Cleaned Dataset

We load the cleaned version of the UCI Adult Income Dataset from the processed data directory into a Pandas DataFrame. The head(10) function shows the first ten records, giving a glimpse into the data columns such as Town, Address, Property Type, etc.

Code
real_estate_filename = os.path.join(processed_dir, 'real_estate_cleaned.csv')
real_estate = pd.read_csv(real_estate_filename)
real_estate
Adult Income Dataset
Serial Number List Year Date Recorded Town Address Assessed Value Sale Amount Sales Ratio Property Type Residential Type
0 160091 2016 12/23/2016 Avon 2 EDGEWOOD 143390 224000.0 0.640134 Condo appartments
1 160172 2016 4/13/2017 Bethel 66 H NASHVILLE ROAD 80500 130000.0 0.619231 Condo appartments
2 160258 2016 6/29/2017 Bethel 1 EAGLE ROCK HILL 117180 200000.0 0.585900 Condo appartments
3 160103 2016 11/30/2016 Branford 137 PEDDLARS DR 106700 155000.0 0.688387 Condo appartments
4 167930 2016 9/27/2017 Bridgeport 95 LANCE CIR 105820 148000.0 0.715000 Condo appartments
... ... ... ... ... ... ... ... ... ... ...
143962 190120 2019 4/28/2020 Rocky Hill 174 VALLEY CREST DRIVE 174650 242000.0 0.721700 Single Family single-residence
143963 190460 2019 8/17/2020 Southbury 10 CONCORD CT 300250 412000.0 0.728800 Condo appartments
143964 19200 2019 6/29/2020 Monroe 39 FARM VIEW RD 238400 377400.0 0.631700 Single Family single-residence
143965 190480 2019 2/25/2020 New Haven 498 HOWARD AV 110740 248000.0 0.446500 Two Family duplex-residence
143966 19022 2019 10/17/2019 Watertown 305 BEACH AVE 192800 235000.0 0.820400 Single Family single-residence

143967 rows × 10 columns

Code
real_estate_filename = os.path.join(processed_dir, 'real_estate_cleaned.csv')
real_estate = pd.read_csv(real_estate_filename)
real_estate

3 Dataset Dimensions and Data Types

Here, we examine the structure of the dataset:

  • There are 143,967 entries and 10 variables.
  • The dataset includes both numerical (e.g., list year, dates recorded) and categorical variables (e.g., town, property type).

Understanding data types and null entries is essential before proceeding with analysis.

Code
summary_df = pd.DataFrame({
    'Column': real_estate.columns,
    'Data Type': real_estate.dtypes.values,
    'Missing Values': real_estate.isnull().sum().values
})
summary_df
Table 1: Overview of dataset columns, their data types, and the count of missing values in each column.
Column Data Type Missing Values
0 Serial Number int64 0
1 List Year int64 0
2 Date Recorded object 0
3 Town object 0
4 Address object 0
5 Assessed Value int64 0
6 Sale Amount float64 0
7 Sales Ratio float64 0
8 Property Type object 0
9 Residential Type object 0

4 Summary Statistics: Numerical Variables

Code
real_estate.describe()
Table 2: Summary statistics for numerical variables in the dataset, including count, mean, standard deviation, min, and quartile values.
Serial Number List Year Assessed Value Sale Amount Sales Ratio
count 1.439670e+05 143967.000000 143967.000000 143967.000000 143967.000000
mean 6.270426e+05 2017.567123 161573.011760 247757.685692 0.668229
std 5.434197e+06 1.124513 76604.109558 119804.305879 0.134377
min 1.610000e+02 2016.000000 1680.000000 2000.000000 0.298372
25% 1.603150e+05 2017.000000 105980.000000 160000.000000 0.584300
50% 1.709830e+05 2018.000000 147000.000000 228000.000000 0.651503
75% 1.900580e+05 2019.000000 206900.000000 320000.000000 0.734146
max 1.920000e+08 2019.000000 380370.000000 684000.000000 1.078614

This summary provides a snapshot of key distribution characteristics. We observe that:

  • Serial Number is a unique identifier, ranging widely due to the sequential nature of property listings. It is not used for statistical analysis but ensures data integrity.

  • List Year ranges from 2016 to 2019. This indicates that the dataset covers four years of listing activity, allowing for temporal trend analysis. The values are evenly distributed across these years.

  • Assessed Value ranges from 1,680 to 380,370. This variable shows some skewness toward higher property values. The average lies below the 75th percentile, indicating that while most properties are modestly valued, a few high-value properties exist and pull the mean upward.

  • Sale Amount ranges from 2,000 to 684,000 after outlier removal. Like Assessed Value, it is right-skewed, with many properties clustered around typical residential prices and a few much higher-value sales. The median is a better central measure here.

  • Sales Ratio ranges from ~0.29 to 1.07. A value of 1 represents a perfect match between sale amount and assessed value. Most values lie below 1, indicating that many properties sold for less than their assessed value. This ratio helps in evaluating market valuation accuracy across towns or years.

5 Summary Statistics: Categorical Variables

Code
real_estate.describe(include='object')
Table 3: Summary statistics for categorical variables in the dataset, including the number of unique values, most frequent category (top), and its frequency (freq).
Date Recorded Town Address Property Type Residential Type
count 143967 143967 143967 143967 143967
unique 1167 169 133891 5 5
top 7/1/2019 Bridgeport 71 AIKEN ST Single Family single-residence
freq 454 5806 24 103968 103968

This section highlights the composition and variety within the categorical features of the dataset:

  • Date Recorded spans from 2016 to 2020, capturing the transaction recording date. This field is useful for tracking seasonal or year-specific market behavior. The format is string-based, and cleaning may be required before time series analysis.

  • Town includes 169 unique town names. This variable is crucial for geographic segmentation. Some towns have a significantly higher number of property sales, which could be indicative of population density or economic activity.

  • Address is a string-based location descriptor. While useful for identifying specific properties, it often includes inconsistent formatting or missing values (e.g., ‘nan’), making it less reliable for grouping without cleaning.

  • Property Type is consistent across the dataset with a single value: Residential. This simplifies analysis, as there’s no need to disaggregate by different property purposes like commercial or industrial.

  • Residential Type includes five subtypes: Condo, Single Family, Two Family, Three Family, and Four Family. These categories reflect structural distinctions and can heavily influence sale amounts and assessment values. For example, single-family homes may generally command higher per-unit values than multi-family properties.

6 Visualisation.

6.1 Number of Properties By Sales Amount

Code
# Create the histogram
fig = px.histogram(
    real_estate,
    x='Sale Amount',
    nbins=30,
    title='Distribution of Sale Amounts',
    labels={'Sale Amount': 'Sale Amount'},
    color_discrete_sequence=['skyblue']
)

# Update layout for better visuals
fig.update_layout(
    xaxis_title='Sale Amount',
    yaxis_title='Number of Properties',
    bargap=0.1,
    template='plotly_white'
)

# Make responsive and show hover/click info
fig.show(config={"responsive": True})

fig.write_image(os.path.join(results_dir, 'Distribution_of_Sale_Amounts.jpg'))
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sale_Amounts.png'))

6.1.1 Iterpretation : Distribution Of Sales Amount

The histogram above illustrates the distribution of property sale amounts across the dataset (filtered to exclude extreme outliers above $1M) ’s a detailed breakdown of what the plot reveals:

  • Skewness: The distribution is clearly right-skewed — a common pattern in real estate. Most properties are sold for relatively modest amounts, with fewer sales occurring at very high prices.

  • Mode (Most Frequent Range): The tallest bars are centered around $150,000 to $200,000, where we see the peak number of transactions (over 12,000). This suggests the majority of homes fall within this pricing range.

  • Density Curve (KDE): The smooth blue curve overlays the bars and visually reinforces the distribution shape. The sharp rise and gradual taper highlight the concentration of mid-priced properties, followed by a long tail extending toward higher prices.

  • Spread: Sale amounts range broadly from near $0 up to around $700,000, with fewer properties selling at the extremes. The lower end includes possible undervalued or distressed sales, while the upper end includes premium properties.

  • Implications: This right skew implies that using the median sale price is more appropriate than the mean, which may be inflated by a small number of expensive sales. The long tail also suggests log transformation might be useful if this variable is used in predictive modeling.

6.2 Number of Properties by Residential Type

Code
# Count the number of properties per Residential Type
res_type_counts = real_estate['Residential Type'].value_counts().reset_index()
res_type_counts.columns = ['Residential Type', 'Count']

# Create bar chart
fig = px.bar(
    res_type_counts,
    x='Residential Type',
    y='Count',
    title='Count of Each Residential Type',
    color_discrete_sequence=['#1f77b4']
)

fig.update_layout(
    xaxis_title='Residential Type',
    yaxis_title='Number of Properties',
    xaxis_tickangle=30,
    template='plotly_white'
)

# Show interactive chart
fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Count_of_Each_Residential_Type.jpg'))
fig.write_image(os.path.join(results_dir, 'Count_of_Each_Residential_Type.png'))

6.2.1 Interpretation: Count of Each Residential Type

The bar chart above displays the distribution of properties by Residential Type, giving us a categorical breakdown of the housing structures in the dataset.

  • Dominance of Single Family Homes: Single Family residences are the most common type in the dataset, significantly outnumbering all other categories. This reflects the typical composition of residential areas, where detached homes are the standard property form.

  • Prevalence of Condos and Two Family Units: Condo units and Two Family homes are also quite prominent, suggesting a good mix of individual and shared housing structures. Condos may be more concentrated in urban centers, while Two Family homes could reflect multi-generational housing or rental opportunities.

  • Smaller Share of Three and Four Family Homes: Three Family and especially Four Family properties are much less common. These types likely represent either older constructions, niche rental properties, or homes in specific zoning areas that allow multi-unit dwellings.

  • Insights for Market Segmentation: Understanding this breakdown is valuable for market analysis and pricing comparisons. For example, pricing strategies, maintenance costs, and investment appeal can differ greatly between Single Family and multi-family properties.

6.3 Number of Properties By Sales Ratio

Code
# Ensure numeric conversion
real_estate['Sales Ratio'] = pd.to_numeric(real_estate['Sales Ratio'], errors='coerce')

# Replace inf and -inf with NaN
real_estate = real_estate.replace([np.inf, -np.inf], np.nan)

# Drop rows with NaN in Sales Ratio
real_estate = real_estate.dropna(subset=['Sales Ratio'])
Code
# Create histogram of Sales Ratio
fig = px.histogram(
    real_estate,
    x='Sales Ratio',
    nbins=30,
    title='Distribution of Sales Ratio',
    labels={'Sales Ratio': 'Sales Ratio'},
    color_discrete_sequence=['skyblue']
)

fig.update_layout(
    xaxis_title='Sales Ratio',
    yaxis_title='Number of Properties',
    bargap=0.1,
    template='plotly_white'
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sales_Ratio.jpg'))
fig.write_image(os.path.join(results_dir, 'Distribution_of_Sales_Ratio.png'))

6.3.1 Interpretation: Distribution Of Sales Ratio

This histogram displays the distribution of the Sales Ratio, which compares a property’s sale price to its assessed value (i.e., Sale Amount ÷ Assessed Value). Here’s what the chart reveals:

  • Peak & Central Tendency: The distribution peaks around a Sales Ratio of ~0.65, indicating that most properties were sold for about 65% of their assessed value. This suggests a tendency toward undervaluation or conservative government assessments.

  • Distribution Shape: The plot is slightly right-skewed, but generally resembles a bell-shaped (normal-like) curve. The bulk of the properties lie between 0.5 and 0.8, forming a concentrated central cluster.

  • Lower and Upper Bounds:

    • Minimum visible in this plot: around 0.3
    • Maximum visible: just above 1.0, with very few properties exceeding the assessed value
    • A Sales Ratio of 1.0 represents a perfect match between market sale and assessed value. Only a small tail of properties is observed beyond that point.
  • Market Insight: The fact that the vast majority of Sales Ratios are below 1.0 implies that properties in this dataset typically sold for less than their assessed value, which could reflect:

    • An overestimation by assessors,
    • A cooling market,
    • Or simply negotiated discounts.
  • Implication for Valuation Models: The Sales Ratio provides valuable insight into how closely government-assessed values align with actual market behavior.

  • A Sales Ratio below 1.0 (which is the majority) suggests that most properties were sold for less than their assessed value, potentially indicating that assessments are overestimated.

  • A Sales Ratio close to or above 1.0 suggests properties that sold at or above their assessed values—this may indicate strong demand or accurate valuation.

6.4 Sales by Top 20 Towns

Code
# Get top 20 towns by count
top_towns = real_estate['Town'].value_counts().nlargest(20).reset_index()
top_towns.columns = ['Town', 'Count']

# Create bar chart
fig = px.bar(
    top_towns,
    x='Town',
    y='Count',
    title='Top 20 Towns by Number of Sales',
    color_discrete_sequence=['skyblue']
)

fig.update_layout(
    xaxis_title='Town',
    yaxis_title='Number of Sales',
    xaxis_tickangle=45,
    template='plotly_white'
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Top_20_Towns_by_Number_of_Sales.jpg'))
fig.write_image(os.path.join(results_dir, 'Top_20_Towns_by_Number_of_Sales.png'))

6.4.1 Interpretation: Top 20 Towns by Number of Sales

This bar chart displays the 20 towns with the highest number of real estate transactions. Each bar reflects the volume of property sales in a given town.

Key Observations: Bridgeport dominates the chart with nearly 6,000 sales, significantly ahead of the second-place town. This sharp lead indicates a highly active real estate market in Bridgeport.

Hamden, Waterbury, and Stamford follow with over 3,000 sales each, showing strong but more moderate activity.

From Meriden onward, the number of sales gradually decreases, but all towns shown still reflect high engagement in the market compared to the full list of towns in the dataset.

The bars are relatively evenly spaced among the rest, suggesting a moderately competitive real estate environment in those areas.

Implications: Bridgeport’s high sales volume might be attributed to factors like population size, housing turnover, or market affordability. It could be a prime candidate for targeted investment or policy focus.

Towns like Hamden, Waterbury, and Stamford may represent stable and active secondary markets, appealing to both buyers and sellers.

Local authorities and developers might consider prioritizing infrastructure, zoning, or community development efforts in the top towns to support continued growth and meet demand.

6.5 Averagae Sales Ratio By Town

Code
# Select top 5 towns by number of sales
#| echo: false
#| output: false
#| warning: false
top_towns = real_estate['Town'].value_counts().nlargest(5).index
avg_sales_ratio_top = avg_sales_ratio[avg_sales_ratio['Town'].isin(top_towns)]
Code
# Calculate average sales ratio by Town and Residential Type
avg_sales_ratio = real_estate.groupby(['Town', 'Residential Type'])['Sales Ratio'].mean().reset_index()

# Select top 5 towns by number of sales
top_towns = real_estate['Town'].value_counts().nlargest(5).index
avg_sales_ratio_top = avg_sales_ratio[avg_sales_ratio['Town'].isin(top_towns)]

# Create grouped bar chart
fig = px.bar(
    avg_sales_ratio_top,
    x='Town',
    y='Sales Ratio',
    color='Residential Type',
    barmode='group',
    title='Average Sales Ratio per Town by Residential Type (Top 5 Towns)',
    color_discrete_sequence=px.colors.sequential.Blues
)

fig.update_layout(
    xaxis_title='Town',
    yaxis_title='Average Sales Ratio',
    xaxis_tickangle=45,
    template='plotly_white',
    legend_title_text='Residential Type'
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'income_distribution_by_agegroup_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'income_distribution_by_agegroup_bar_plot.png'))

6.5.1 Interpretation: Average Sales Ratio per Town by Residential Type

This grouped bar chart shows the average sales ratio for different residential types across the top 5 towns: Bridgeport, Hamden, Meriden, Stamford, and Waterbury.

  • Across all towns, single-residence and triplex-residence types tend to exhibit higher average sales ratios, often exceeding 0.65.
  • Quadplex-residences generally show the lowest average sales ratios, especially in Stamford and Waterbury.
  • Hamden and Meriden stand out for having consistently higher sales ratios, suggesting stronger or more stable housing demand.

This plot reveals that both town and residential type play a role in market performance — an important insight for investors, analysts, and policymakers.

6.6 Average Sale Amount By List Year

Code
# Plot
# Group and average
avg_sale_by_year = real_estate.groupby('List Year', as_index=False)['Sale Amount'].mean()

# Create line chart with markers
fig = px.line(
    avg_sale_by_year,
    x='List Year',
    y='Sale Amount',
    title='Average Sale Amount by List Year',
    markers=True,
    line_shape='linear',
    color_discrete_sequence=['teal']
)

fig.update_layout(
    xaxis_title='List Year',
    yaxis_title='Average Sale Amount',
    xaxis=dict(tickmode='linear', tickvals=avg_sale_by_year['List Year']),
    template='plotly_white'
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Average_Sale_Amount_by_List_Year.jpg'))
fig.write_image(os.path.join(results_dir, 'Average_Sale_Amount_by_List_Year.png'))

6.6.1 Interpretation: Actual Sale Amounts Over Time

This line chart illustrates the actual sale amounts for properties across the time range.

  • From 2016 to 2018, sales activity was relatively lower, possibly indicating fewer transactions or lower market demand.
  • A significant rise in sales amounts begins in 2018 and continues into 2019, closely aligning with the assessed value increase from the same period.
  • The post-2018 surge suggests a market shift where real buyer activity and transaction values started matching the earlier assessment growth.

This chart highlights how market realization (sales) tends to follow assessment trends, reflecting delayed responses in actual property transactions.

6.7 Average Assessed Value By List Year

Code
# Plot
# Group and average
avg_sale_by_year = real_estate.groupby('List Year', as_index=False)['Assessed Value'].mean()

# Create line chart
fig = px.line(
    avg_sale_by_year,
    x='List Year',
    y='Assessed Value',
    title='Average Assessed Value by List Year',
    markers=True,
    line_shape='linear',
    color_discrete_sequence=['teal']
)

fig.update_layout(
    xaxis_title='List Year',
    yaxis_title='Average Assessed Value',
    xaxis=dict(tickmode='linear', tickvals=avg_sale_by_year['List Year']),
    template='plotly_white'
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Average_Assessed_Value_by_List_Year.jpg'))
fig.write_image(os.path.join(results_dir, 'Average_Assessed_Value_by_List_Year.png'))

6.7.1 Interpretation: Assessed Property Values Over Time

This line chart displays how the assessed property values changed over the analyzed period.

  • Between 2016 and 2018, the assessed values were relatively higher, indicating consistent or anticipated market value during that period.
  • From 2018 to 2019, there is a noticeable upward boost, showing a sharp increase in assessment values — possibly due to reevaluation of properties or broader economic factors.
  • This trend suggests that the market’s perceived value of properties increased before actual sales amounts caught up.

Overall, assessed values may reflect official estimates or expectations in the housing market, which can precede actual pricing behavior.

6.8 Residential Type Distribution

Code
# Plot pie chart
# Prepare data
res_cat_counts = real_estate['Residential Type'].value_counts().reset_index()
res_cat_counts.columns = ['Residential Type', 'Count']

# Create pie chart
fig = px.pie(
    res_cat_counts,
    names='Residential Type',
    values='Count',
    title='Residential Type Distribution',
    color_discrete_sequence=px.colors.sequential.Blues,
    hole=0
)

# Show values inside slices
fig.update_traces(
    textinfo='percent+label',
    textposition='inside',
    insidetextorientation='radial'
)

# Update layout: increase figure size
fig.update_layout(
    template='plotly_white',
    width=800,
    height=600
)

fig.show(config={"responsive": True})

# Save images
fig.write_image(os.path.join(results_dir, 'Residential_Type_Distribution.jpg'))
fig.write_image(os.path.join(results_dir, 'Residential_Type_Distribution.png'))

6.8.1 Interpretation: Residential Type Distribution

This pie chart illustrates the distribution of property types in the dataset.

  • Single-residence properties dominate the dataset, making up 72.2% of all records — indicating a clear market focus.
  • Appartments come next at 20.6%, followed by duplex-residences (4.7%), with triplex and quadplex residences making up the smallest shares (2.1% and 0.4%, respectively).

The visualization reveals that multi-family housing (duplex, triplex, quadplex) is relatively rare compared to individual residential units, which may reflect development patterns or buyer preferences.